Key Issues in Vowel Based Splitting of Telugu Bigrams

نویسنده

  • T. Kameswara Rao
چکیده

Splitting of compound Telugu words into its components or root words is one of the important, tedious and yet inaccurate tasks of Natural Language Processing (NLP). Except in few special cases, at least one vowel is necessarily involved in Telugu conjunctions. In the result, vowels are often repeated as they are or are converted into other vowels or consonants. This paper describes issues involved in vowel based splitting of a Telugu bigram into proper root words using Telugu grammar conjunction (‘sandhi’) rules for MT. Keywords—Telugu word splitting; vowel based splitting; compound word splitting; bigrams; trigrams; n-grams; NLP

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Telugu Bigram Splitting using Consonant-based and Phrase-based Splitting

Splitting is a conventional process in most of Indian languages according to their grammar rules. It is called ‘pada vicchEdanam’ (a Sanskrit term for word splitting) and is widely used by most of the Indian languages. Splitting plays a key role in Machine Translation (MT) particularly when the source language (SL) is an Indian language. Though this splitting may not succeed completely in extra...

متن کامل

Wordform- and Class-based Prediction of the Components of German Nominal Compounds in an AAC System

In word prediction systems for augmentative and alternative communication (AAC), productive wordformation processes such as compounding pose a serious problem. We present a model that predicts German nominal compounds by splitting them into their modifier and head components, instead of trying to predict them as a whole. The model is improved further by the use of class-based modifierhead bigra...

متن کامل

Online Recognition of Handwritten Telugu Characters

A system for online recognition of handwritten Telugu script is presented. A handwritten character is constructed by executing a sequence of strokes. A structureor shape-based representation of a stroke is used in which a stroke is represented as a string of shape features. Using this string representation, an unknown stroke is identified by comparing it with a database of strokes. A full chara...

متن کامل

Vowel Identification Using Piecewise Separation Technique

A simple method for computer recognition of Telugu speech sounds irrespective of speakers is described. A vocabu­ bulary consisting of 871 Telugu words containing the ten vowels (f'O/./a :/,/i/,/i :/,/u/,/u :/,/e/,/e :/./0/ and /0 :/) in constant­ vowel nucleus-consonant (CNC) combination and uttered by three informants was selected as the testing material. Formant frequencies. Flo F2 and F3 of...

متن کامل

Classification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm

Classifying and recognizing Telugu characters (aksharas) is a challenging task because of the variations in the script and the large number of characters. The complexity of the shape is a result of structural compositions involving vowels (V), consonants (C), consonants with vowel modifiers (CV) and consonant clusters (CCV). This paper presents a novel classification strategy for classifying ak...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014